library(lubridate)
library(knitr)
library(viridis)
library(lme4)
library(lmerTest)
library(kableExtra)
library(MLmetrics)
library(caret)
library(tseries)
library(forecast)
#library(tidymodels)
suppressPackageStartupMessages(library(tidyverse))

Setup

set.seed(1)

Read in the data

df <- read_tsv("../data/Question2.tsv")

Get a sense of data size/shape

print(dim(df))
## [1] 981   3
kbl(head(df), 'html', align = 'r') %>% kable_styling()
Date Cost Revenue
2019-08-20 $4,214.69 $14,349.50
2019-08-21 $2,627.13 $13,463.13
2019-08-22 $3,196.08 $10,297.64
2019-08-23 $1,134.45 $6,296.62
2019-08-24 $3,207.85 $10,291.51
2019-08-25 $7,325.59 $24,733.41

Quality Control

Check for NA and duplicates

df %>%
  summarise(across(everything(), list(zeros = ~sum(is.na(.)))),
            duplicated.days = sum(duplicated(Date))) %>%
  kbl('html') %>% kable_styling()
Date_zeros Cost_zeros Revenue_zeros duplicated.days
0 0 0 1

There is one duplicated day, and there are no NAs

df %>%
  filter(Date == df$Date[duplicated(df$Date)]) %>%
  kbl('html') %>% kable_styling()
Date Cost Revenue
2021-09-03 $5,432.35 $20,120.06
2021-09-03 $5,432.35 $20,120.06

The above shows that the corresponding Cost and Revenue are duplicated for the duplicated date, so just delete one instance

df = df %>% filter(!duplicated(Date))

Data wrangling, data types, re-coding, and feature engineering

  1. We’ll convert Cost and Revenue, which read in as character/string data, to numeric values
  2. For convenience, we’ll extract day, month, and year, as well as “year day” and “month day” from the Date column to aid in visualization and possibly modeling.
  3. Compute a Profit variable, estimated as \(Revenue - Cost\)
df1 = df %>%
  mutate(
    # convert the cost and revenue columns from character strings of the format
    # $ [n-digit dollars] . [two-digit cents]
    across(c(Cost, Revenue), ~{
      dollars = as.numeric(gsub('$|[[:punct:]]', '', substr(., 1, nchar(.) - 2)))
      cents = substr(., nchar(.) - 1, nchar(.))
      as.numeric(paste(dollars, cents, sep = '.'))
    }),
    # From the date column, it may be handy to have the day, the month, year,
    # the day within year (from 0 - 365 excluding leap years), day w/in month
    day = day(Date), month = month(Date), year = factor(year(Date)), 
    yday = yday(Date), mday=mday(Date), # day w/in year and month
    # For visualization it can be helpful to have just the month-day (no year)
    month_and_day = as.factor(sub('.{5}', '', Date)),
    # Code the variable profit, estimated by Revenue - cost
    #Profit = Revenue - Cost, efficiency = Revenue / Cost)
    Profit = Revenue - Cost)
head(df1) %>%
  kbl('html') %>% kable_styling()
Date Cost Revenue day month year yday mday month_and_day Profit
2019-08-20 4214.69 14349.50 20 8 2019 232 20 08-20 10134.81
2019-08-21 2627.13 13463.13 21 8 2019 233 21 08-21 10836.00
2019-08-22 3196.08 10297.64 22 8 2019 234 22 08-22 7101.56
2019-08-23 1134.45 6296.62 23 8 2019 235 23 08-23 5162.17
2019-08-24 3207.85 10291.51 24 8 2019 236 24 08-24 7083.66
2019-08-25 7325.59 24733.41 25 8 2019 237 25 08-25 17407.82

Exploratory and Graphical Data Analysis

View the distribution of our response Revenue

ggplot(df1) +
  geom_density(aes(x = Revenue, fill = year), color='black', alpha = .5, linewidth=.5) +
  theme_minimal() +
  scale_fill_viridis_d() +
  facet_grid(rows = vars(year))

This variable shows a high degree of positive skew that is pretty consistent across years. Let’s view it on a log scale:

ggplot(df1) +
  geom_density(aes(x = Revenue, fill = year), color='black', alpha = .5, linewidth=.5) +
  theme_minimal() +
  scale_fill_viridis_d() +
  facet_grid(rows = vars(year)) +
  scale_x_log10()

Transforming the axes (and implicitly visualizing log(Revenue)) reveals a trend across years, and also normalizes the data fairly well. Overall result on the variable after log-transformation:

df1 %>%
  ggplot +
  geom_density(aes(x = log(Revenue)), 
               fill='skyblue', color='black', alpha = .5, linewidth=.5) +
  theme_minimal() 

Visualize yearly time series of log(Revenue) and log(Profit)

# Format data for plotting
plt.dat = df1 %>%
  mutate(`log(Revenue)` = log(Revenue), `log(Profit)` = log(Profit))
# plot of: Revenue (log transformed)
plt.dat %>% 
  ggplot +
  geom_line(aes(x = yday, y = `log(Revenue)`, color = year, group = year)) +
  # Label just the first day of each month:
  scale_x_continuous(
    name = 'Month',
    breaks = as.numeric(plt.dat$yday)[(plt.dat$mday == 1)&(plt.dat$year==2021)],
    labels = month(1:12, label = T)
  )  +
  # set relevant y-axis
  scale_y_continuous(limits = c(8,12.25)) +
  theme_minimal()

# plot of: Profit = Revenue - Cost (log transformed)
plt.dat %>% 
  ggplot +
  geom_line(aes(x = yday, y = `log(Profit)`, color = year, group = year)) +
  # Label just the first day of each month:
  scale_x_continuous(
    name = 'Month',
    breaks = as.numeric(plt.dat$yday)[(plt.dat$mday == 1)&(plt.dat$year==2021)],
    labels = month(1:12, label = T)
  )  +
  # set relevant y-axis
  scale_y_continuous(limits = c(8,12.25)) +
  theme_minimal()

Conclusions from plot of \(\log(Revenue)\)

  1. For a year that we have complete data, 2021 shows the most revenue, and this appears to be a yearly trend such that 2019 < 2020 < 2021 < 2022.

  2. All revenue (in log units) appears to increase approximately monotonically from January to the holiday season, such that, each month is higher on average than the previous. However a sharp decline starts mid December to make this rule inaccurate overall.

  3. 2020 appears to outlie relative to other years. It shows a lot more variability in the slow-frequency trend, with a surprisingly high Jan - Mar, followed by a large drop and then a big rebound. The dip and rebound may be due to the start of the COVID-19 pandemic, or noise.

Conclusions from plot of \(\log(Profit)\)

  1. Profit is tightly correlated with revenue, and redundant in this case

Relationships between Cost and other variables

(log-)Linear relationship between Cost and Revenue

ggplot(df1, aes(x = log(Cost), y = log(Revenue))) +
  geom_point() +
  geom_smooth(method = 'lm', formula = 'y~x') +
  facet_wrap(~year,scales = 'fixed') +
  theme_minimal()

This graph shows that despite the long positive skew in the distributions of Cost and Revenue, the relationship between Cost and Revenue is reliable even at extreme values of both. There appears to be a strong correlation between Cost and Revenue, and the relationship between Cost and log-transformed Revenue appears to be linear. As such, it appears appropriate to model log(Revenue) going forward. Finally, this trend appears consistent between across years, however, albeit based on fewer data points Cost appears to be less effective in generating Revenue in 2022.

Dynamic relationship between Cost and Revenue

df1 %>%
  pivot_longer(cols = c('Revenue', 'Cost'), names_to = 'name', values_to = 'value') %>%
  ggplot +
  geom_line(aes(yday, y = log(value), color = name, group = name)) +
  scale_x_continuous(
    name = 'Month',
    breaks = as.numeric(df1$yday)[(df1$mday == 1)&(df1$year==2021)],
    labels = month(1:12, label = T),
  ) +
  scale_y_continuous(limits = c(6.5,12.25)) +
  facet_grid(rows = vars(year)) +
  theme_bw()

Correlation between Cost and Revenue

(ct = cor.test(df1$Cost, df1$Revenue, method = 'spearman'))
## 
##  Spearman's rank correlation rho
## 
## data:  df1$Cost and df1$Revenue
## S = 24072006, p-value < 2.2e-16
## alternative hypothesis: true rho is not equal to 0
## sample estimates:
##       rho 
## 0.8465433

The dynamic coupling between Cost and Revenue is significant, \(\rho_{spearman}\) = 0.85 , p < .0001. Furthermore, Cost seems to predict both short term fluctuations, and long term trends in Revenue such that seasonal trends may be redundant.

Assess the seasonality of Revenue

df1 %>%
  ggplot +
  geom_line(aes(x = Date, y = log(Revenue))) +
  theme_bw()

The plot suggests that a yearly seasonal trend arises, which resets mid-December each year.

Apply the Dickey-Fuller test to assess significance of seasonal trend

adf.test(df1$Revenue)
## Warning in adf.test(df1$Revenue): p-value smaller than printed p-value
## 
##  Augmented Dickey-Fuller Test
## 
## data:  df1$Revenue
## Dickey-Fuller = -4.1153, Lag order = 9, p-value = 0.01
## alternative hypothesis: stationary

This suggests the time series is stationary and can be appropriately modeled using an ARIMA-like model (ARIMA or ARIMAX)

Conclusions from EDA and graphical analysis

  1. Cost and Revenue should be log-transformed because doing so makes them approximately normal, which will enable us to model mean Revenue more reliably
  2. While yearly trends exist (different years are overall based around different means) and within years monthly trends are fairly constant (i.e., increasing between Jan and Dec), these trends are likely to be confounded with Cost
  3. Given the strong dynamic correlation between Cost and Revenue, the data should be modeled using time series analysis
  4. The best fitting model might be a seasonal ARIMA (“SARIMA”), however it is also possible that an ARIMA model will fit just as well with Cost as an exogenous predictor

Engineer data for modeling

Rescale and recode variables as needed

df2 = df1 %>%
  # Log transform Revenue and Cost
  mutate(log.Revenue = log(Revenue),
         log.Cost = log(Cost))
# Split December into two parts, before and after the peak (for that year)
for (y in unique(df2$year)) {
  dec.data = df2 %>% filter(month == 12, year == y)
  if (nrow(dec.data) > 0) { # if this year has december data
    dec.data[-1:-which.max(dec.data$Cost), 'month'] = 13
    df2[(df2$year == y) & (df2$month == 12), 'month'] = dec.data$month
  }
}
# Store month as an ordered factor
df2 = df2 %>% mutate(across(month, ordered))

Convert the data to a timeseries object

df3 = df2 %>%
  # To aid interpretability, and since we're not using OLS, 
  select(log.Revenue, log.Cost, Revenue, Cost) %>%
  ts # convert to time series using ts()
head(df3)
## Time Series:
## Start = 1 
## End = 6 
## Frequency = 1 
##   log.Revenue log.Cost  Revenue    Cost
## 1    9.571470 8.346331 14349.50 4214.69
## 2    9.507710 7.873647 13463.13 2627.13
## 3    9.239670 8.069680 10297.64 3196.08
## 4    8.747768 7.033903  6296.62 1134.45
## 5    9.239075 8.073356 10291.51 3207.85
## 6   10.115910 8.899129 24733.41 7325.59

Fit SARIMAX model

Note, the first model I fit provided stationary = TRUE, implicitly fitting an ARIMA model, but the residuals were auto-correlated. The SARIMA model held better to the assumption that residuals are independent. Finally, evidence for independence of residuals was highest when using raw Revenue and Cost, rather than their log transformations.

mod <- auto.arima(df3[, "Revenue"], xreg = df3[, "Cost"], stationary = F)
# Check model fit
mod
## Series: df3[, "Revenue"] 
## Regression with ARIMA(1,1,2) errors 
## 
## Coefficients:
##           ar1     ma1      ma2    xreg
##       -0.6280  0.0417  -0.4760  3.7673
## s.e.   0.1358  0.1291   0.0758  0.0744
## 
## sigma^2 = 32821109:  log likelihood = -9858.98
## AIC=19727.96   AICc=19728.02   BIC=19752.39

The best fitting model suggests autocorrelation and non-stationarity, as well as significant explanatory/predictive valuef Cost.

Visualize fit

plt.dat %>% 
  mutate(predicted = mod$fitted) %>%
  pivot_longer(cols = c('Revenue', 'predicted'), names_to = 'name', values_to = 'value') %>%
  ggplot +
  geom_line(aes(yday, y = log(value), color = name, group = name)) +
  scale_x_continuous(
    name = 'Month',
    breaks = as.numeric(df1$yday)[(df1$mday == 1)&(df1$year==2021)],
    labels = month(1:12, label = T),
  ) +
  scale_y_continuous(limits = c(6.5,12.25)) +
  facet_grid(rows = vars(year)) +
  theme_bw()

The model appears to fit very well with just Cost as a predictor

Diagnostics

checkresiduals(mod)

## 
##  Ljung-Box test
## 
## data:  Residuals from Regression with ARIMA(1,1,2) errors
## Q* = 11.67, df = 7, p-value = 0.1119
## 
## Model df: 3.   Total lags used: 10

Question 1

The first day of the week we spend $17,007.92, how would you allocate the remaining budget throughout the week to try to maximize the amount of Revenue that we get for that channel?

Generate new data to reflect different expenditure patterns

budget = 116000 - 17007.92
new_data = tibble(
  uniform = rep(1/6, times = 6),
  increasing = c(.05, .05, .1, .15, .3, .35),
  decreasing = rev(increasing),
  max1 = ifelse(1:6 == 1, .80, (1 - .80)/5),
  max2 = ifelse(1:6 == 2, .80, (1 - .80)/5), # one day takes majority of expenditure
  max3 = ifelse(1:6 == 3, .80, (1 - .80)/5),
  max4 = ifelse(1:6 == 4, .80, (1 - .80)/5),
  max5 = ifelse(1:6 == 5, .80, (1 - .80)/5),
  max6 = ifelse(1:6 == 6, .80, (1 - .80)/5)
)
all(sapply(new_data, sum) == 1) # they should integrate to 1
## [1] TRUE
# Different expenditure patterns:
(new_data = new_data %>% mutate(across(everything(), ~.*budget)))
## # A tibble: 6 × 9
##   uniform increasing decreasing   max1   max2   max3   max4   max5   max6
##     <dbl>      <dbl>      <dbl>  <dbl>  <dbl>  <dbl>  <dbl>  <dbl>  <dbl>
## 1  16499.      4950.     34647. 79194.  3960.  3960.  3960.  3960.  3960.
## 2  16499.      4950.     29698.  3960. 79194.  3960.  3960.  3960.  3960.
## 3  16499.      9899.     14849.  3960.  3960. 79194.  3960.  3960.  3960.
## 4  16499.     14849.      9899.  3960.  3960.  3960. 79194.  3960.  3960.
## 5  16499.     29698.      4950.  3960.  3960.  3960.  3960. 79194.  3960.
## 6  16499.     34647.      4950.  3960.  3960.  3960.  3960.  3960. 79194.

Predict revenue for each scenario

Plots show the forecasted predictions, and provide the cumulative revenue for the forecasted week.

predictions = map(new_data, ~{forecast(mod, xreg = .x)})
map(1:length(predictions), ~{
  p = predictions[[.x]]
  cond = names(new_data)[.x]
  revenue = round(sum(as.vector(p$mean)), 2)
  plt.t = paste0('Total Forecasted Revenue from 4/26/2022 to 5/1/2022
                 condition: ', cond, '; revenue: ', revenue)
  autoplot(p) + 
    coord_cartesian(xlim = c(nrow(df3) - 90, nrow(df3) + 6)) +
    ggtitle(plt.t) +
    theme_minimal()
  })
## [[1]]

## 
## [[2]]

## 
## [[3]]

## 
## [[4]]

## 
## [[5]]

## 
## [[6]]

## 
## [[7]]

## 
## [[8]]

## 
## [[9]]

Most profitable day

lapply(predictions[-1:-3], function(x) x$mean) %>%
  as.data.frame %>%
  as.matrix %>%
  diag %>%
  {data.frame(Date = tail(df2$Date,1) + 1:6, Revenue = .)} %>%
  arrange(desc(Revenue)) %>%
  kbl('html') %>% kable_styling()
Date Revenue
2022-04-26 280490.9
2022-04-28 280091.3
2022-04-30 279933.8
2022-05-01 279766.7
2022-04-29 279667.7
2022-04-27 279416.8

The total revenue predictions are all the same because they are based solely on and the effect size of Cost in the model was small relative to the units of Revenue. However, a closer inspection reveals that holding all else constant, spending on this channel on 4/26/2022 will lead to the most Revenue.

minimum_cost = 100
while(T) {
  revenue = forecast(mod, xreg = minimum_cost)$mean
  if (revenue > 0) break else minimum_cost = minimum_cost + 10
}
data.frame(minimum_cost = minimum_cost, revenue = revenue)
##   minimum_cost  revenue
## 1         4740 1.837349

In addition, predicted revenue is in the red unless costs reach at least $4740. As such, it would be inadvisable to spend any less than $4,740 in the channel on our most advantageous day (4/26/2022)

Multilevel modeling approach

Viewing the four scatterplots of the bivariate relationship between Cost and Revenue within each year reveals that each year. Additionally, the slope of the regression line predicting Revenue from Cost seems to vary by year, suggesting a possible interaction. Additionally, to account for the historical data obtained for each month, we can assume each month to have a random intercept. These aspects of the data can be modeled using multilevel models, with random effects for various timing variables (e.g., year or month), in addition to the effect of Cost.

mem1 = lmer(log.Revenue ~ (1|month) + log(Cost)*year, data = df2)
mem2 = lmer(log.Revenue ~ (1|year) + log(Cost), data = df2)
summary(mem1)
## Linear mixed model fit by REML. t-tests use Satterthwaite's method [
## lmerModLmerTest]
## Formula: log.Revenue ~ (1 | month) + log(Cost) * year
##    Data: df2
## 
## REML criterion at convergence: 131.5
## 
## Scaled residuals: 
##     Min      1Q  Median      3Q     Max 
## -3.2849 -0.5698  0.0497  0.6316  4.0875 
## 
## Random effects:
##  Groups   Name        Variance Std.Dev.
##  month    (Intercept) 0.01661  0.1289  
##  Residual             0.06219  0.2494  
## Number of obs: 980, groups:  month, 13
## 
## Fixed effects:
##                     Estimate Std. Error        df t value Pr(>|t|)    
## (Intercept)          2.29297    0.33859 961.70169   6.772 2.21e-11 ***
## log(Cost)            0.92784    0.04180 971.99989  22.196  < 2e-16 ***
## year2020             2.17702    0.38600 971.99229   5.640 2.23e-08 ***
## year2021            -0.41698    0.39446 971.26037  -1.057    0.291    
## year2022             2.53502    0.51455 971.98987   4.927 9.83e-07 ***
## log(Cost):year2020  -0.25612    0.04740 971.61705  -5.403 8.25e-08 ***
## log(Cost):year2021   0.03524    0.04754 971.92870   0.741    0.459    
## log(Cost):year2022  -0.30298    0.06101 971.96645  -4.966 8.06e-07 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Correlation of Fixed Effects:
##             (Intr) lg(Cs) yr2020 yr2021 yr2022 l(C):2020 l(C):2021
## log(Cost)   -0.992                                                
## year2020    -0.818  0.809                                         
## year2021    -0.803  0.794  0.788                                  
## year2022    -0.640  0.639  0.570  0.580                           
## lg(Cs):2020  0.826 -0.822 -0.997 -0.784 -0.573                    
## lg(Cs):2021  0.829 -0.826 -0.793 -0.996 -0.592  0.795             
## lg(Cs):2022  0.672 -0.676 -0.578 -0.584 -0.995  0.585     0.602
summary(mem2)
## Linear mixed model fit by REML. t-tests use Satterthwaite's method [
## lmerModLmerTest]
## Formula: log.Revenue ~ (1 | year) + log(Cost)
##    Data: df2
## 
## REML criterion at convergence: 383.5
## 
## Scaled residuals: 
##     Min      1Q  Median      3Q     Max 
## -2.4195 -0.6190 -0.0709  0.6385  3.5354 
## 
## Random effects:
##  Groups   Name        Variance Std.Dev.
##  year     (Intercept) 0.004903 0.07002 
##  Residual             0.084863 0.29131 
## Number of obs: 980, groups:  year, 4
## 
## Fixed effects:
##              Estimate Std. Error        df t value Pr(>|t|)    
## (Intercept)   3.15923    0.13081 282.33818   24.15   <2e-16 ***
## log(Cost)     0.82103    0.01505 935.79499   54.56   <2e-16 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Correlation of Fixed Effects:
##           (Intr)
## log(Cost) -0.960

While these models do show significant effects of year and month, there’s not much evidence to treat these aspects as random effects, due to the low variability between months and years relative to their standard errors. This suggests that we should stick with the ARIMA model

Question 1 Answer:

Based on the model forecasts, the week following April 25th 2022 I would suggest that which days within this week don’t matter as much as how much is spent on this channel. To that end, I would suggest using whatever is budgeted for this week. There is minor evidence that allocating expenditures to 4/26 and 4/28 is the most advantageous, followed by 4/30, 5/1, 4/29 and 4/27, however, the apparent differences in expected revenue across days may may be due to noise in the model and indicative of extrapolation error, so the latter suggestion should be treated with caution.

Question 2

How much Revenue do you predict that we will make in that channel for the given week with the given budget?

Question 2 Answer:

Based on the first model, I predict the week will return $17,007.92 + $262,223.33 for the remaining days, totaling $279,231.20.

Next steps and limitations

One important limitation is that these models were not cross-validated. Therefore the predictions they generate might be due to overfitting. Since we have extensive and dense historical data, this may not be a huge concern in terms of our ultimate predictions about Revenue. However, cross-validation can be a useful tool for model comparison. To that extent, we would be able to assess if the predictions from the multilevel models are inflated relative to the SARIMAX model.